8 research outputs found

    A comparison of machine learning and rule-based approaches for text mining in the archaeology domain, across three languages

    Get PDF
    Archaeology is a destructive process in which the evidence primarily becomes written documentation. As such, the archaeological domain creates huge amounts of text, from books and scholarly articles to unpublished ‘grey literature’ fieldwork reports. We are experiencing a significant increase in archaeological investigations and easy access to the information hidden in these texts is a substantial problem for the archaeological field, which has been identified as early as 2005 (Falkingham 2005). In the Netherlands alone, it is estimated that 4,000 new grey literature reports are being created each year, as well as numerous books, papers and monographs. Furthermore, as research – such as desk based assessments – are increasingly being carried out online remotely, these documents need to be made more easily Findable, Accessible, Interoperable and Reusable. Making these documents searchable and analysing them is a time consuming task when done by hand, and will often lack consistency. Text mining provides methods for disclosing information in large text collections, allowing researchers to locate (parts of) texts relevant to their research questions, as well as being able to identify patterns of past behaviour in these reports. Furthermore, it enables resources to be searched in meaningful ways using semantic interoperable vocabularies and domain ontologies to answer questions on what, where and when. The EXALT project at Leiden University is working on creating a semantic search engine for archaeology in and around the Netherlands, indexing all available, open-access texts, which includes Dutch, English and German language documents. In this context, we are systematically researching and comparing different methods for extracting information from archaeological texts, in these 3 languages. The specific task we are looking at is Named Entity Recognition (NER), which is to find and recognise certain concepts in text, e.g. artefacts, time periods, places, etc. In the archaeology domain, the task of entity recognition is particularly specialised and determined by domain semantics that pose challenges to conventional NER. We develop text mining applications tailored to the archaeological domain and in this process we will compare a rule-based knowledge driven approach (using GATE), a ‘traditional’ machine learning method (Conditional Random Fields), and a deep learning method (BERT). Previous studies have investigated different applications of text mining in archaeological literature (Richards et al. 2015), but this often occurred at a relatively small scale, in isolated case studies, or as proof-of-concept type work. With this study, we are comparing multiple methods in multiple languages, and we aim to contribute to guidelines and good practice for text mining in archaeology. Specifically, we will compare not only the overall accuracy of each approach, but also the time, digital literacy, hardware, and labelled data needed to run each method. We also pay attention to the energy usage and CO2 output of these machine learning models and the impact on climate change, something that’s particularly poignant during the ongoing energy crisis. Besides these more practical aspects, we also aim to describe some general properties of the way we write about archaeology, and how writing in a particular language can make knowledge transfer (and by extension, NER) easier or more difficult

    Can BERT Dig It? -- Named Entity Recognition for Information Retrieval in the Archaeology Domain

    Get PDF
    The amount of archaeological literature is growing rapidly. Until recently, these data were only accessible through metadata search. We implemented a text retrieval engine for a large archaeological text collection (∼658\sim 658 Million words). In archaeological IR, domain-specific entities such as locations, time periods, and artefacts, play a central role. This motivated the development of a named entity recognition (NER) model to annotate the full collection with archaeological named entities. In this paper, we present ArcheoBERTje, a BERT model pre-trained on Dutch archaeological texts. We compare the model's quality and output on a Named Entity Recognition task to a generic multilingual model and a generic Dutch model. We also investigate ensemble methods for combining multiple BERT models, and combining the best BERT model with a domain thesaurus using Conditional Random Fields (CRF). We find that ArcheoBERTje outperforms both the multilingual and Dutch model significantly with a smaller standard deviation between runs, reaching an average F1 score of 0.735. The model also outperforms ensemble methods combining the three models. Combining ArcheoBERTje predictions and explicit domain knowledge from the thesaurus did not increase the F1 score. We quantitatively and qualitatively analyse the differences between the vocabulary and output of the BERT models on the full collection and provide some valuable insights in the effect of fine-tuning for specific domains. Our results indicate that for a highly specific text domain such as archaeology, further pre-training on domain-specific data increases the model's quality on NER by a much larger margin than shown for other domains in the literature, and that domain-specific pre-training makes the addition of domain knowledge from a thesaurus unnecessary

    User Requirement Solicitation for an Information Retrieval System Applied to Dutch Grey Literature in the Archaeology Domain

    Get PDF
    In this paper, we present the results of user requirement solicitation for a search system of grey literature in archaeology, specifically Dutch excavation reports. This search system uses Named Entity Recognition and Information Retrieval techniques to create an effective and effortless search experience. Specifically, we used Conditional Random Fields to identify entities, with an average accuracy of 56%. This is a baseline result, and we identified many possibilities for improvement. These entities were indexed in ElasticSearch and a user interface was developed on top of the index. This proof of concept was used in user requirement solicitation and evaluation with a group of end users. Feedback from this group indicated that there is a dire need for such a system, and that the first results are promising
    corecore